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Proteins participating in a protein-protein interaction network can be grouped into homology 
classes following their common ancestry. Proteins added to the network correspond to genes added 
to the classes, so that the dynamics of the two objects are intrinsically linked. Here, we first intro- 
duce a statistical model describing the joint growth of the network and the partitioning of nodes 
into classes, which is studied through a combined mean-field and simulation approach. We then 
employ this unified framework to address the specific issue of the age dependence of protein inter- 
actions, through the definition of three different node wiring/divergence schemes. Comparison with 
empirical data indicates that an age-dependent divergence move is necessary in order to reproduce 
the basic topological observables together with the age correlation between interacting nodes visible 
in empirical data. We also discuss the possibility of nontrivial joint partition/topology observables. 



I. INTRODUCTION 

The protein-protein interaction (PPI) network rep- 
resents the physical interactions between proteins in a 
cell pQ. The topological properties of this complex net- 
work provide an effective overview of the protein-protein 
interactions coded by a genome, with implications for the 
analysis of signaling and metabolic pathways [2J. 

In the course of evolution, a genome acquires new 
genes, and thus new proteins, by different evolutionary 
processes [3J H] , which include gene duplication and hor- 
izontal transfers. These processes define groups of pro- 
teins with the same common ancestor, termed homology 
classes. Notably, homology classes follow well-defined 
quantitative laws with specific mathematical proper- 
ties [fflBj, dependent only on genome size and not on 
further details of a genome's evolutionary history [8]. 

Following gene duplications [9], proteins belonging to 
the same homology class can modify their binding in- 
terfaces to conserve ancient interactions, lose them, or 
evolve new ones. This process generates new PPI net- 
work configurations which are subject to selective pres- 
sures of different kinds [TUHT2"] , and allow to construct in- 
creasingly complex biomolecular machinery [13 15] . This 
mechanism of "duplication-divergence" has inspired a 
thread of graph-growth modeling work within the physics 
and computational biology community [16 22 . Gener- 
ally speaking, these models generate random graph en- 
sembles by iteratively adding new nodes that are initially 
copies of existing ones (and thus interact with all their 
binding partners) and subsequently lose and/or rewire in- 
teractions by a set of simplified prescription rules. This 
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basic mechanism produces graph topologies resembling 
empirical PPI networks in many aspects. Comparison 
of model predictions and empirical data leads to the hy- 
pothesis that duplication-divergence can (at least in part) 
explain PPI network topologies [3TJ [Ml [M] , starting from 
the basic observation that duplicate proteins are often in- 
volved in similar protein-protein interactions |131 115j . 

While it appears that gene duplication plays a role 
in shaping PPI networks through evolutionary time [25j , 
many questions remain open. For example, it has 
been pointed out that the duplication-age profiles nat- 
urally emerging from duplication-divergence models do 
not resemble empirical data, and that, quite reasonably, 
the availability of binding interfaces could impose ad- 
ditional relevant constraints 26-28J. Accordingly, al- 
ternative models have been proposed, where the wiring 
rules account for these constraints [26] , Additionally, 
according to most of these models, "collapsing" multi- 
ple homologous neighbors of a protein into one neigh- 
bor should make the broad degree distribution consid- 
erably narrower, which does not seem to be the case 
in empirical data [30]. Thus, the actual growth mech- 
anisms of PPI networks is still under debate and it is 
unclear how much duplication-divergence versus other 
constraints can account for the topology of empirical 
PPI networks [2SJ [25] • Additionally, duplication- 
divergence models typically neglect the process of homol- 
ogy classes expanding and being formed within a genome, 
and thus cannot describe how PPI network links are dis- 
tributed among homology classes. However, the subdivi- 
sion of genes into homology classes could constitute an- 
other relevant constraint for the PPI network's structure 
and should not be neglected a priori. 

This work addresses the above issues through a mod- 
eling approach. We consider a (null) statistical graph- 
growth model describing the joint growth of PPI network 
and homology classes structure. The output of the model 
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is a growing graph, whose nodes are partitioned into 
equivalence classes following the empirical size distribu- 
tions of protein classes. The model defines a framework 
for testing alternative mechanisms of network growth, 
where duplication-divergence can have different weight 
during the process and thus different consequences on 
the final properties of the network. Within this setting, 
we ask about the ingredients that can account for the 
joint growth of homology classes and network, as well as 
reproducing the main empirical observables such as de- 
gree distribution, degree correlation and correlation be- 
tween interacting duplication-age groups. In our analysis 
we find in particular that reproducing the empirical age- 
correlation between interacting nodes requires a heavy 
bias on the duplication-divergence process, which must 
correspond to additional constraints of functional or of 
physical origin. 



II. BACKGROUND. 

A. Network growth by duplication-divergence 

Perhaps the simplest PPI network growth model incor- 
porating the basic moves of duplication and divergence 
(DD) was introduced and studied in [19 . In this model, 
the network grows by node duplication and subsequent 
deletion of some of the duplicate links with a prescribed 
probability (divergence). More precisely, at each step a 
randomly-chosen network node is copied, initially inher- 
iting all the interactions of the original node, and in a 
second substep the new node's links are deleted inde- 
pendently with probability 1 — er. If no link is left after 
divergence, the duplicate node itself is deleted, so that 
the network remains connected throughout its evolution. 
This process is completely asymmetric, meaning that the 
parent node (the one chosen for duplication) does not 
lose any connection, and the divergence process only af- 
fects the daughter. More general variants have been pro- 
posed, for instance by relaxing the requirement of com- 
plete asymmetry and of single-gene duplication |21j . or by 
introducing rewiring between existing nodes (which can 
even become dominant in shaping the network [3D]). For 
simplicity, we will restrict to the one-parameter model in 
the following. 

One of the main features of this model is that the 
described mechanism leads to an effective preferential 
attachment principle, since high-degree nodes are more 
likely to have a neighbour being duplicated by random 
choice. Specifically, the probability of a new link be- 
ing attached to a node of degree k is proportional to k. 
As a consequence, the degree distribution of the grow- 
ing network develops power-law tails ~ fc~ 7 for large 
degrees [TpJ. Exponents in the range 7 € [2,3] are re- 
alized by choices of a <E (0, 1/2]. Comparison with avail- 
able subsets of empirical PPI networks yields values of 
the link-retention probability a around 0.40(±0.05) for 
S. cerevisiae, D. melanogaster and H. sapiens [19|. The 



average total number of links L(N) as a function of the 
network size N can also be predicted by mean-field cal- 



culations (see Section IV A) 



B. 



Homology class partitioning by the Chinese 
restaurant process 



Duplication plays a fundamental role in the evolution 
of homology classes as well [7] , as it constitutes the main 
drive for class expansion, at least in eukaryotes. Equally, 
a genome "innovation" move (for instance by horizontal 
transfer) causes the creation of new homology classes. 

A simple class of partitioning processes incorporat- 
ing the basic moves of class expansion and innovation 
is capable of explaining the scaling laws observed in 
domain-class partitioning [8]. The paradigm of these 
models is the so-called "Chinese Restaurant Process" 
(CRP) [51 |3"TH3"3"] . which is the one that will be used here. 
In this process, at each iteration the genome goes from 
having nton+1 genes, and either a new class is created 
(with probability p ncw ) or a domain is added to an ex- 
isting class (with probability p Q id = 1 — Pncw)- A crucial 
ingredient of the CRP is the dependence of p n ow and p id 
on the size of the growing proteome, whose effect is to 
reproduce in the model the observed sublinear scaling of 
the number of domain classes F(N) with genome size N: 
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where a £ (0, 1) and 9 > are parameters of the model. 
(The extreme cases a = 0, 1 could be included, but we 
will neglect them here for clarity.) The per-class proba- 
bility of duplication is defined as 



(i) = Jj-a 
Pold N + 9' 



(2) 



where ji is the size of the i-th class. This corresponds to 
an asymptotically uniform extraction, which realizes an 
effective preferential attachment principle. The parame- 
ter a describes the dominance of innovation over dupli- 
cation, while 9 is a fixed size scale at which preferential 
attachment sets in. Mean-field calculations, supported 
by simulations, show |8] that the asymptotic behaviors of 
the class-size distribution /(j, N) and of the total number 
of classes F(N) are 



/(j,A0~r (1+a) , 

F(N) ~ N a , 



(3) 



for large N and j. As a consequence, p ne w and p \d scale 
as 

( 4 ) 

Pold ~ 1 - aN*- 1 . 

These predictions are in good qualitative agreement with 
empirical data for prokaryotic proteomes [TJ, [S] . 
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III. MODEL AND METHODS 

A. Definition of a statistical model combining 
genome partitioning and network growth 

As we discussed, from a simplifying perspective, the 
growth of PPI networks and genome partitioning in ho- 
mology classes are produced by essentially the same basic 
evolutionary moves of innovation and duplication on the 
genes. For this reason, the model proposed here is de- 
fined by abstract realizations of these basic moves on the 
level of both the network and the homology classes. This 
is achieved by a simple coupling between the duplication- 
divergence model of network growth and the CRP parti- 
tioning, as reviewed in the Background section. In par- 
ticular, a class expansion move is associated with a net- 
work duplication move, and a proteome innovation move 
with a network move wiring the new node to the existing 
network. Thus, the model could be termed "Duplication 
Divergence Innovation Wiring" (DDIW), and describes 
the growth of homology classes and PPI network jointly. 

Let p ncw , p^ d , and p old = J2iPo\d be defined as in 
([I]) and ([2| in terms of the number of classes F(N) and 
the size of the i-th class ji- The basic data structure of 
the model includes the topology of the PPI network, and 
the information on the partitioning of its nodes (see Fig- 
ure [lj. Given a proteome/network of size N, the growth 
process is defined by the following two rules acting on 
the classes and on the graph topology. 

1. a: DUPLICATION (classes) Choose a class i 

with probability p^L, and duplicate a randomly- 
chosen target node inside class i. 
b: DIVERGENCE (network) Attach the new 
node to each of the target's neighbors indepen- 
dently with probability a. 

2. a: INNOVATION (classes) Otherwise (i.e., with 
probability J3 n cw), create a new node in a new class, 
b: WIRING (network) Attach the new node to 
one or more nodes in the existing network, inde- 
pendently of their classes. (The additional rules 
describing this step are listed in Sec. Ill B ) 



Altogether, there are three parameters governing the 
dynamics, a £ (0, 1), 9 > 0, and a 6 (0, 1]. Notice that, 
while the network dynamics is dependent on the configu- 
ration of the partitioning, the evolution of the latter is not 
affected by what happens at the network level. There- 
fore, partitioning is assured by definition to reproduce 
the CRP predictions for all choices of the parameters. 
Notice that class-expansion can also occur by horizontal 
transfer of members of an existing homology class [51], 
but we will disregard this process here. In fact, while this 
mechanism is widespread in bacteria, we found that there 
was no need to incorporate it explicitly in the model in 
order to have a good fit with data for both networks and 
homology classes. 
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Figure 1: (Color online) Illustration of the moves in the 
DDIW model. At each step either a new class containing one 
node is added, and the new node is linked to one or more ex- 
isting nodes (innovation-wiring, done with probability p n ew), 
or a randomly-chosen node is duplicated inside a class and 
the replica's links activated independently with probability a 
(duplication-divergence, done with probability p id). Filled 
circles are nodes, lines are links; large circles are homology 
classes; the red node and its dashed links are the results of 
a duplication-divergence move; the blue node and its dot- 
dashed links are the results of an innovation-wiring move. 



Technically, we choose a slightly different divergence 
rule from the model of ref. [19] . In order for duplication to 
always be successful (i.e., no node being left without any 
links) we impose a randomly chosen link to be conserved, 
and divergence to be performed on the remaining ones, 
i.e. the model assumes that each duplicated node is pre- 
served by selection and cannot be disconnected from the 
existing network. The same hypothesis holds for the orig- 
inal model, but is implemented by removing the discon- 
nected nodes. The different implementation implies that 
the divergence rule explained in Sec. [TT] yields a degree- 
dependent probability of duplication, since less connected 
nodes are more prone to have all their links disconnected; 
the rule used here, instead, assigns the same probability 
of duplication to every node. Despite this bias, the mod- 
ified model incorporates the same basic mechanisms as 
the previous one, and we verified that it leads to the 
same qualitative results (some features match also quan- 
titatively, see Sec. IV A). The main rationale behind this 



choice is a simplification of the mean-field equations, as it 
makes it unnecessary to estimate the number of deleted 
nodes. 

The initial condition will be chosen as the complete 3- 
graph, which is the smallest non-bipartite network. Re- 
sults do not change appreciably by starting with different 
small networks (we did not study systematically the de- 
pendency of the results from initial conditions built as 
large networks). 

We choose to exclude self-interactions from the model, 
as they play a biologically distinct role in the net- 
work, and they probably deserve to be considered sep- 
arately [H]. 
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B. Model variants allowing to study the effect of 
different growth mechanisms on the topology 

The wiring rule is not completely specified by the defi- 
nitions above. Its implementation will be given in the 
following. At the network level, the rules concerning 
the topology can be modified without affecting the basic 
structure of the model. Here, we study a minimal ver- 
sion and consider different variants for such rules, which 
allow to address the recently formulated problem of the 
age-dependency of empirical interactions |26j . 

We start focusing on the wiring move. Once intro- 
duced, the new node can be attached to a single node 
chosen in the existing network by a preferential attach- 
ment (PA) or anti-preferential attachment (AP) principle 
with respect to the old node's degree. The former alter- 
native describes the tendency of new, specialized proteins 
to interact more likely with old proteins that perform ba- 
sic tasks, the latter reflects the relationship between the 
binding probability and the available interaction surface 
of existing nodes [26]. Alternatively, the new node can 
be wired to a size-dependent or configuration-dependent 
number I of existing nodes. 

Other modifications are possible for the divergence 
move, for example by making the link-retention probabil- 
ity a depend on the current configuration of the network 
or on the age difference between the two nodes that are 
connected by the link considered by divergence. Here we 
consider three main variants [42] (see Fig. [2]) 

A DDIW + AP. The wiring move establishes a single 
new link between the new node and an existing node 
i of degree fc,-, chosen with probability proportional to 
1/fcj. This anti-preferential rule reflects the growing 
of the binding probability with the interaction surface 
available. 

B DDIW + extensive AP (EAP). The wiring move at- 
taches the new node to I = [7 (fc)] existing nodes, 
chosen with anti-preferential attachment; [7 (fc)] is the 
closest integer to a fraction 7 € (0, 1) of the mean 
degree in the present configuration. 

C Age-dependent DDIW (A-DDIW). The wiring move 
is the same as in [X] The divergence step implements 
a kind of preferential attachment which takes into ac- 
count the node's age in the following way. Let a, be the 
age of node i, i.e., the number of iterations the process 
underwent since the node was born. A link to node i 
inherited from the target node is kept with probability 
1 if cii < crN, where N is the size of the network, and 
with probability otherwise. This rule implements 
non-neutral selective pressure towards maintaining an- 
cient well-established basic cellular machinery. 




Figure 2: (Color online) Variants of the model. Symbols and 
colors have the same meaning as in Fig. [l] (A) DDIW+AP 
(anti-preferential-attachment innovation with a single link). 
During innovation, the new node carries one new link, whose 
target node is chosen with probability inversely proportional 
to its degree. (B) DDIW+EAP (anti-preferential-attachment 
innovation, with multiple links). During innovation, the new 
node carries a number of links proportional to the current av- 
erage degree. (C) A-DDIW (age-dependent divergence). Dur- 
ing divergence of a duplicated node, the probability of keeping 
a link depends on the difference in age between the two nodes 
linked (higher age differences corresponding to lower proba- 
bilities). 



C. Empirical Data Sets and Data Analysis 
Methods 

Data for protein binding is obtained from the most 
recent (october 2011) Database of Interacting Proteins 
(DIP) 35] . We filter out self-interactions between sin- 
gle proteins and interactions between proteins expressed 
by different genomes; different strains are considered as 
different organisms. Moreover, we exclude all virus data, 
and all networks with less than 10 nodes. We end up with 
1 archaeon, 14 bacteria, and 7 eukaryotes; a list of all or- 
ganisms considered in the study of network topology is 
presented in Table [TJ together with the observed number 
of proteins N and interactions L. Notice that the net- 
works we can construct from DIP only include subsets of 
the full proteomes. For example the C. elegans network 
in our dataset is smaller than that of S. cerveisiae, despite 
its genome being much larger, possibly creating signifi- 
cant under- sampling problems in the data. See Sec. [V] 
for a discussion of this issue. 

Homology classes are built starting from the SUPER- 
FAMILY database for domain assignment [3S]. We re- 
construct the domain architectures as ordered lists of do- 
mains and gaps; a gap is defined as a subsequence of 100 
or more "AA" not scored for domain |37| . Two proteins 
are in the same homology class if their architectures are 
exactly matching. We also tested a more relaxed crite- 
rion (allowing for repetitions of domain architectures), 
and obtained the same results as those presented in the 
following for the stricter criterion. Moreover, we also 
considered data restricted to longest transcripts in eu- 
karyotes, finding no difference in the scaling (we remark 
that longest-transcript data in the dataset are very in- 
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complete, so we will not include them in the forthcoming 
analysis). We filter out genomes with more than 19 000 
assignments; altogether, we work with data for 1384 or- 
ganisms — 87 archaea, 1077 bacteria, and 220 eukaryotes 
- for the homology classes, but only 22 networks with 
sufficiently large sampling of the interactions. 

Beside network topology and homology classes, we are 
interested in evolutionary ages of proteins. For the pro- 
teome of S. cerevisae, we use data from Wapinski et al. 
[38] . where duplication events for a number of genes of 
S. cerevisiae are divided into ten classes, labeled A, B, 
C, D, E, WGD, G, H, I, J, depending on when in the 
evolutionary history of Ascomycota they occurred (class 
A being the more recent). We further group these classes 
into four superclasses (labeled Gl-4), keeping the whole- 
genome duplication (WGD) alone, due to the abundance 
of its elements: 

Gl = I + J 
G2 = G + H 
G3 = WGD 

G4 = A + B + C + D + E. 

By this procedure, we assign 210 genes to age group Gl, 
85 to G2, 691 to G3 (WGD), and 91 to G4. The age of a 
protein is defined as the superclass of the oldest duplica- 
tion event in which it is reported to be involved. It should 
be noted that the WGD has a different phenomenology 
than the single-gene duplication events considered here; 
we do not exclude it from our data, but its modeliza- 
tion is out of the scope of the present work (see [22]). In 
order to evaluate the history dependency of protein inter- 
actions, we use the interaction density D m ^ n between two 
age groups m and n as an indicator of age correlation. It 
is defined, following [25], as 



IV. RESULTS 

We ask in which conditions the model or its variants 
fulfill the following requirements. Firstly, it should qual- 
itatively reproduce the features of both the duplication- 
divergence and CRP "pure" models. Secondly, it de- 
scribes the enriched data-structure of network plus ho- 
mology classes, and it should predict the behavior of 
joint topology-partition observables, including history- 
dependency of interactions. 

All variants of the DDIW model reproduce the same 
homology-class scaling as the pure CRP, essentially be- 
cause the class partitioning is not affected by the net- 
work dynamics by definition. A simple scaling argument 
suggests that the duplication-divergence predictions are 
expected to be recovered for large N, since the scaling of 
p ncw and Poid, Eq. Q, shows that duplication becomes 
dominant in this regime. Therefore, the model is ex- 
pected to behave as pure duplication-divergence in the 
large- N limit; it remains to clarify what happens at in- 
termediate values of N. In the following subsections we 
address some of these questions; the large- N behavior is 
clarified by mean-field techniques, while finite values of 
N are studied by means of numerical simulations. The 
analysis of how the partitioning into homology classes 
correlates with the network structure will be briefly ad- 
dressed to in Sec. [Vl but its systematic study will be left 
to future work. 



Mean-field theory accurately predicts scaling of 
the total number of links 



= log 2 



L m , n N(N-l) 



(5) 



where L m _ n is the number of links between the age groups 
m and n and E m ^ n is the number of possible links between 
nodes of the two groups, which only depends on the num- 
ber of nodes in m and n. The average interaction density 
gradient, defined as [25] 



AD 



n — 2 m<n 



(6) 



measures the overall correlation present between the ages 
of proteins; a positive value indicates that newer nodes 
preferentially link with newer nodes. We will use the 
sign of AD as a marker of correlation or anti-correlation 
between ages. 

Fits of data against non-linear analytic expressions 
are performed by minimization of the squared residuals 
through the standard Levenberg-Marquardt method, and 
are systematically checked for stableness under the intro- 
duction of a cutoff on small-size data. 



Mean-field calculations give reliable estimates for the 
behavior of the duplication-divergence network growth 
model and for the class-expansion innovation model sep- 
arately [H , therefore it makes sense to apply the same 
procedure to the joint model. The mean-field approach 
essentially consists in neglecting the fluctuations due to 
the statistical nature of the models and writing "macro- 
scopic" differential equations for the average quantities, 
which can be treated analytically. In this section wc 
will use this tool to study the average total number of 
links L(N) as a function of the number of nodes N for 
the variants of the joint-evolution model described in the 
previous section. In principle, other characteristics of 
the network may be accessible through mean-field calcu- 
lations, such as the degree distribution, but we will not 
treat them here. 

For the duplication-divergence model alone (in the 
variant defined in section III A), the simplification we 



introduced allows to write a slightly more general ex- 
pression for L(N) than that obtained in [19]. Let Nk be 
the average number of nodes with k links in a network of 
size N (the average is intended on all realizations of the 
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stochastic process up to size N). Clearly, 



5> 



N, 



and 



J2kN k =2L(N), 



(7) 



(8) 



where the sums are extended to all possible values of 
the degree k (say, from 1 to oo). L(N) varies at each 
duplication following the mean-field equation 



AL(N) ~ Yl 

k 



Nk 
N 



[1 + (k - l)a] 



(9) 



where AL(N) = L(N + 1) - L(N). The summand takes 
into account the duplication of a node of degree k, which 
is performed with probability N k /N . The term in square 
brackets reflects the fact that by definition at least one 
of the links is maintained, while the other k — 1 links are 
kept independently with probability a. Performing the 
sum by applying identities ^ and ^ yields 



AL(N) ~ (1 - a) + 2a 



N 



(10) 



This can be approximated by the following (large N) 
differential equation 



AL 
AN 



(11) 



Solving this equation with a formal initial condition 
L(Nq) = Lq gives the solution 



L(N) 



' " G - N- (l - ^-^-N 



l-2a 



1-2(7 



N 
N~ 



2 a 



(12) 



In the following, we will fix the initial condition to the 
complete 3-graph (£(3) = 3), in order to avoid the pro- 
liferation of irrelevant parameters. The presence of two 
regimes is apparent, where the first or the second term 
dominate, corresponding to a < 1/2 and a > 1/2 respec- 
tively. Notice the alternating-sign pattern of the correc- 
tions to scaling, which can cause the observation of a 
small-size effective exponent higher than both 1 and 2a 
(see Sec. IV B). By taking the limit a — > 1/2 one has 
L(N) = l/2(N\ogN) + O(N), thus recovering the three 
different regimes of the original DD model [THj . Figure [3] 
shows that mean-field predictions correctly reproduce the 
results of simulations, even for fairly small values of N; 
small deviations from mean-field appear only for large 
values of a, which are not very relevant empirically, as 
the link density would be too high compared to empirical 
data. 

We now consider the different variants of the joint 
DDIW model. The increase in the total number of links 
at each step is given either by I (if the innovation move is 




10 10 

network size 



Figure 3: (Color online) Average total number of links 
L(N) as a function of network size for the pure duplication- 
divergence model. Solid lines show the mean-field prediction, 
while symbols are the results of numerical simulations (100 
realizations); error bars are smaller than symbols. Triangles 
correspond to a — 0.2, diamonds to 0.4, squares to 0.6, circles 
to 0.8. 



chosen) or by the same sum as in ([9| , if the duplication 
move is chosen. We will not consider variant [C] since in 
this case solving the mean-field equation for the number 
of links requires knowledge of the node-age distribution 
in the network. Thus, for the first two variants we have 

AL(N) ~ p ncw l{N) + Pold ^ [1 + (fc - 1)<t] , 

fc 

where l(N) is the average of I over realizations of the 
process up to size N. By plugging in the asymptotic 
forms Q and taking the continuum approximation as in 
(10 1, we obtain 



AL 

AN 



aN a -H(N) 



+ 11- aN c 



1 - a + 2a 



(13) 



which has to be solved separately for the two cases 
l(N) = 1 (DDIW+AP, variant [£} and l(N) = ~/2L/N 
(DDIW+EAP, variant [B]) . The solution is presented in 
some detail in the Appendix, and we concentrate here on 
the asymptotic behavior. Up to exponential corrections 
of the form exp(x _r ') with rj > 0, the number of links 
scales as 



L(N) 

for variant lAl and as 



aN 2 



bN 



L(N) - cN + dN a + eN 



(14) 



(15) 



for variant |B} a and b are functions of a and a, while c, d, 
and e are functions of a, a, and 7. The exponential cor- 
rections are proportional to exp(p new ), which indicates 



7 



the influence the partitioning process has on the early 
stages of the growth process. Figure [4] shows a com- 
parison between mean-field results and numerical simu- 
lations. Deviations are apparent for (a, a) — (0.6,0.6) 
and (0.2, 0.6) in the DDIW+EAP variant, but theoreti- 
cal predictions are accurate for other values and for the 
DDIW+AP variant. The structure of the power-law cor- 
rections to scaling is similar to that of the pure DD model 
and as long as a < 2a (which is the case for the universal 
fits to empirical data presented in Sec. IV B) the asymp- 



totic behavior only depends on a, up to the sub-leading 
order. This suggests that the scaling behavior of the 
hybrid DDIW model is to a certain extent robust with 
respect to the details of the innovation dynamics. 

Concerning the scaling of the number of links in vari- 
ant [C] (A-DDIW, Fig. |4|, note that in this case the defi- 
nition of a does not allow to interpret this parameter as 
the average fraction of links retained after node dupli- 
cation. This is due to a non-trivial correlation between 
node age and node degree, which is not straightforward 
to include in the mean-field calculation. Nevertheless, 
numerical simulations indicate that the asymptotic be- 
haviour of L(N) derived for variant [A] (DDIW+AP) also 
holds for variant [C] up to a rescaling of a. This can 
be seen in Fig. |4j where the mean-field predictions are 
compared with numerical results for the rescaled values 



B. Scaling of the number of links and classes as 
functions of genome size is reproduced by universal 
parameters independent of the model variant 

Having established that the scaling for the number of 
links is captured by simple mean-field estimates and in- 
dicates well-defined parameter regimes, we constrain the 
parameters by comparing to the available empirical data. 
Specifically, we fix the three parameters a, 9, and a by 
fitting the mean-field expressions against data for homol- 
ogy classes and for PPI networks. 

The calculations presented in the previous section and 
in the Appendix are not easily extendable to finite values 
of 9; they are valid in the asymptotic limit or when 9 = 0. 
Nonetheless, a corrected expression of F(N) for the case 
9 > and F(l) = 1 can be obtained (see [5]), and it is 
the one we use here to fit the number of homology classes 
as a function of genome size, 



aF(N) 



a + 9 
(l + a ) 



{N + 9) a . 



(16) 



We perform the fits on the empirical dataset for homol- 
ogy classes defined by protein domain architectures, de- 
scribed in Sec. |III C| By taking into account all data, 
we obtain a ~ 0.42 and 9 ~ 124. Estimates change 
slightly by imposing a cutoff, since after N w 1000 data 
show a clearer power law. By including only data with 
N > 1000, we obtain a ~ 0.43 and 9 ~ 118, which 
are compatible with the results obtained from the whole 



available range of genome sizes (without any cutoff). We 
will use the following estimates for all forthcoming com- 
putations: 

a = 0.43, 61 = 121. 

The theoretical mean- field curve for F(N) is plotted 
against data in Fig. [5j A) . 

Turning to the network data and the fit for L(N), a 
non-null value of 9 is not expected to modify the asymp- 
totic behavior, but to act only on the prefactors. There- 
fore we use the mean-field L(N), even if the homol- 
ogy class fits give a non-negligible value of 9. Notice 
that the same scaling seems to apply to the prokary- 
otic genomes as well, despite their network dynamics 
not being dominated by duplication divergence; indeed, 
homology classes prevalently expand by horizontal gene 
transfers [34] . A more precise analysis of this behavior 
can only be carried out with more reliable and abundant 
data; here we use both prokaryotic and eukaryotic data, 
as described in Sec. |III C| Fits against the mean- field pre- 
dictions for L(N) given in the Appendix (with a = 0.43 
fixed) yield 



for variant [XJ and 



a 
a 



a = 0.457(10) 



=0.421(9) (7 = 
=0.460(10) (7 



0) 



for variant [5] values of 7 between and 1 give esti- 
mates between the two extremes. On the other hand, 



a fit against the pure DD prediction ( 12 ) gives 



a = 0.446(10). 

We tested the stability of the foregoing fits by increas- 
ing the cutoff on the network size N from 10 to 100. 
The values do not change appreciably; errors increase 
by approximately 50%. Comparison between DIP data 
|35j and simulations of variant [Cj whose exact behavior 
cannot be calculated via mean-field, give approximately 
a ~ 0.5, which corresponds to an effective link-retention 
probability around 0.4 (see Fig. Etl). 

Figure [5jB) shows numerical results for the three vari- 
ants of our model (with a, 9, and a fixed by the above 
fits) superimposed on the data from DIP. The initial net 
work was chosen as the complete 3-graph (see Sec. IV A 
and |III A). Finite-size effects can be seen, especially 



for the DDIW+EAP variant, but the trend is consis- 
tent. The results for all parameters are compatible with 
each other, therefore we regard this as a model variant- 
independent fit: the two parameters a and a can then be 
seen as "universal" (model variant-independent) quanti- 
ties governing the scaling laws observed in genomes. Very 
similar values of a (« 0.4) were also found in [19 with a 
more detailed analysis of the degree distribution of PPI 
networks and comparison to the model. Note that a sim- 
ple fit of the form L(N) ~ N 2<J on the empirical data 
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Figure 4: (Color online) The scaling of the average total number of links L(N) as a function of network size is captured by 
simple mean-field estimates for all model variants. Solid curves show the mean-field prediction, while symbols are numerical 
results averaged over 100 realizations. (A) Variant DDIW+AP (anti-preferential attachment innovation move): the mean 
field estimate agrees with the simulation results. (B) Varian DDIW+EAP (anti-preferential attachment innovation move with 
extensive number of links): deviations are present for the larger values of a, but there is good agreement between mean-field 
estimate and simulations. (C) A-DDIW (age-dependent duplication divergence): in this case, the mean-field estimates with 
the same slope of panel (A) are valid for simulations with a rescaled value a of the parameter a, related to link retention (note 
that a is not the link retention probability in this variant, see text). 



would yield a — 0.52, i.e. it would suggest a crossover 
regime. According to our analysis, such a higher expo- 
nent appears instead to be an artifact due to the cooper- 
ation of two terms (N 2a and N) with smaller exponents 
but with alternating signs. 

Note that in principle the mean-field derivation is valid 
in the large- TV limit. Figure [5]jB) shows that differences 
in the fit results can be noticed for small cutoffs. We 
chose a low cutoff to genomes with less than 10 nodes in 
order to show this. It must be noted however that many 
"small" networks are actually quite large in reality, but 
extremely under-sampled in the data set. 



C. Comparison with the empirical network of yeast 
reveals the necessity of age-dependent divergence. 

We now turn to the question of the topological prop- 
erties and the age dependency of interactions. In order 



to perform a qualitative comparison between properties 
of an empirical PPI network and the results of computer 
simulations for the three model variants described above, 
we choose the case of baker yeast S. cerevisiae, where 
reliable estimates of the age of no des can be obtained 
from the literature (see Sec. Ill C I . As pointed out in 
[2"5] and [3J5] > while standard duplication-divergence net- 
work growth models well reproduce topological features 
of protein-protein interaction graphs, such as degree dis- 
tribution and clustering coefficient, they fail to capture 
the empirically observed correlation between the evolu- 
tionary ages of interacting proteins. As they discuss, this 
might be obtained from an anti-preferential attachment 
principle, if it becomes a dominant mechanism in defining 
the network topology. 

In order to monitor topology and history dependency 
of interactions we considered the following observables. 
(I) We measured two relevant topology-related quanti- 
ties: the degree distribution nk, defined as the fraction of 
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Figure 5: (Color online) Universal behavior for the number 
of homology classes and for the number of network links. (A) 
Number of homology classes versus total number of proteins. 
Symbols are data from the SUPERFAMILY database [36] , 
the line is a two-parameter (a, 6) fit from the model [equa- 
tion (16 (]. (B) Number of links versus size of sampled net- 



work. Symbols are data from the DIP dataset [35], lines are 
the results of simulations for the three variants of the model, 
with all parameters fixed by fits. Darker triangles point out 
some examples of well-known genomes. Note that many net- 
works (e.g. B. subtilis and all the triangles of smaller size) 
are heavily undersampled in the data set (see Sec. fVp . 



nodes of degree k, and a measure of the degree-degree cor- 
relation, called dk, defined as the average over all nodes 
of degree k of the mean degree of their neighbors. (II) 
To check for age-age correlations, we employed the inter- 
action density D m ^ n and the interaction density gradient 
AD introduced in Sec. ImCl 

The behaviour of the observables considered is shown 
in Fig. [6] for both the empirical PPI network of yeast 
and numerical simulations of the DDIW model variants. 
The model parameters are those obtained in Sec. |IV B| 
As we pointed out before, results for the age class cor- 
responding to the WGD in Fig. [6] should be taken care- 
fully, since homologues in that class were duplicated in 
a phcnomenologically different event. For assessing how 
successful each variant is in reproducing the degree distri- 
bution and the degree correlation we adopt a qualitative 



criterion. Specifically, we consider a monotonically de- 
creasing behavior of the two topological quantities to be 
compatible with empirical data, since this is the behav- 
ior observed in yeast. Concerning node-age correlations, 
we measure the interaction density gradient and verify 
whether it is positive or negative; the reference data for 
yeast give a positive AD. The whole comparison is car- 
ried out in the same spirit as in ref . [26] . 

The DDIW+AP variant successfully reproduces the 
empirical degree correlation and degree distribution, but 
not the pattern of correlation between age groups (AD < 
0). In this model, the innovation move gives a negligi- 
ble contribution to network topology, because the cor- 
responding number of links is always subdominant. In 
fact, we verified that changing the anti-preferential at- 
tachment innovation move into preferential attachment 
has little or no effect on the main topological observables. 
As expected from this argument, this model generates a 
network where new nodes are preferentially connected 
to old nodes, contrary to the pattern that emerges in 
yeast, and equivalently to a pure duplication-divergence 
network growth. However, the anti-preferential mecha- 
nism is capable of generating a qualitatively correct age 
correlation if it can build a large number of links, i.e. 
if the extensive variant DDIW+EAP is considered. In 
this case, due to the progressive increasing in the num- 
ber of links attached in the innovation move, one obtains 
the correct empirical age dependency (AD > 0), but at 
the expense of completely disrupting the topology. For 
this variant, a scatterplot of the degree-degree correla- 
tion (not shown here) presents a slight bimodality in a 
small range of degrees; we chose nonetheless to group the 
data in histograms, in order to highlight how the overall 
behavior is different from the empirical one. Finally, the 
age-dependent DDIW variant is able to account both for 
the topological features and for the age correlation. 

As mentioned above, we have also tested the robust- 
ness of the results under further modifications of the in- 
novation move. No relevant change in the results for 
variant [A] is detectable by applying a preferential attach- 
ment principle instead of an anti-preferential one, nor by 
attaching the new node to a fixed number (> 1) of exist- 
ing nodes. Moreover, variant [B] yields very similar results 
for all values of 7 in (0, 1], and therefore the actual value 
of this parameter should not be regarded as an essential 
quantity. As far as the age-preference variant [C] is con- 
cerned, we remark that an anti-preferential wiring move 
gives the clearest results, but age-age correlations can be 
seen also in networks obtained by means of preferential- 
attachment wiring, as long as this does not dominate over 
the duplication-divergence move. 



V. DISCUSSION AND CONCLUSIONS 

The model presented here can be seen as the proto- 
type of a rather general modeling framework where a 
graph grows by the addition of nodes and links within 
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Figure 6: Qualitative comparison between model variants and empirical data. The average degree of nearest neighbors 
(top row), the degree distribution (middle row), and the interaction density [equation (J5|] between age groups (bottom row) 
are measured for S. cerevisiae (left panel) and for simulations of the three variants of the DDIW model (right panel). The 
DDIW+AP variant successfully reproduces the empirical degree correlation and degree distribution, but the wiring mechanism 
does not provide enough links to reproduce the empirical age correlation; the DDIW+EAP variant correctly shows correlation 
between protein ages, thanks to the increased number of links introduced by innovation, but it strongly distorts the topological 
features of the network; the A-DDIW variant effectively reproduces both topological and age-correlation features observed in 
the empirical network. 



the constraint of a class structure. Indeed, new nodes 
are added to a new class or to an existing one with pre- 
scribed probabilities, their wiring rules being different in 
the two cases. Here, we explored variants where nodes 
added together with a new class are wired to the old net- 
work according to an anti-preferential attachment prin- 
ciple, while nodes introduced into an existing class follow 
a duplication-divergence prescription. The goals of our 
work were twofold. First, we studied the joint evolution 
of the network by duplication/divergence and class ex- 
pansion/innovation. Second, as a case study and proof- 
of-principle application, we applied the unified frame- 
work to the study of age-dependence, where some inter- 
esting questions are open. The two objectives are con- 
nected, as the scenarios we explored would be ill-defined 
outside of this unified framework. For example, assign- 
ing anti-preferential attachment to the innovation move 
requires to be able to distinguish it from a duplication 
move, i.e. to separate new families from existing ones. 
To carry out both objectives, we stayed as close as pos- 
sible to empirical data. 

We considered probabilities of addition of new nodes 
that vanish with N — > oo, in order to reproduce the ob- 



served empirical scaling of homology classes [7]- As a 
consequence, unless it is imposed that new nodes (i.e. 
new nodes belonging to a new homology class) carry 
an extensive number of links, the wiring rule for inno- 
vation is of secondary importance with respect to the 
duplication-divergence move in determining the asymp- 
totic features of the resulting graph ensemble. This is 
in accordance with the empirical observations indicating 
that duplication-divergence is relevant in shaping the ap- 
pearance of the PPI network [TU 03 [TS1 125 . The finite " 
size behavior, nonetheless, is sensitive to the innovation 
process, suggesting the existence of non-trivial features of 
the topology related to the dynamics of homology classes. 

Following these indications, the framework considered 
here can in principle make more detailed predictions for 
observables that involve network and homology classes 
jointly. We analyzed the behavior of one such observable, 
namely the correlation between the total number of links 
originating from a given class and the size of the class. 
While we find good agreement between data for the E. 
coli PPI network and simulations of the DDIW model 
(at least for the two non-extensive variants), they both 
agree with the null expectation that this scaling is linear 
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Figure 7: (Color online) Linear scaling of the correlation be- 
tween the total number of links originating from a given class 
and the size of the class: scatterplot of class degree (sum of 
the degrees of all nodes in a class) versus class size (number 
of nodes). Red crosses (+) represent results from the typ- 
ical realization of the DDIW model with N = 2640 nodes 
as in E. coli's PPI network (for the DDIW+AP variant; A- 
DDIW yields a completely similar plot); green crosses (x) are 
obtained from the same DDIW realization by randomly per- 
muting nodes between classes; blue diamonds are obtained by 
combining data for network structure and homology classes 
for E. coli; the dashed line is the prediction for the average of 
the total class degree in the randomized case, i.e., the mean 
node degree times the class size (here (k) = 2.3). 



(see Fig. [7]). Indeed, in the random case, i.e. when the 
members of homology classes (of prescribed sizes) are 
chosen randomly among network nodes, the total degree 
of a class will be, on average, equal to the number of 
nodes in the class times the mean node degree. Thus, we 
were unable to find such an effect in the data available 
to us. 

Despite the relation between class size and total de- 
gree not being discriminating, the DDIW model does 
generate nontrivial correlations from the joint evolution 
of the network and the partitioning into classes. The 
fact that currently available empirical data do not al- 
low to discriminate should not, in our opinion, discour- 
age the analysis of joint models until more abundant or 
precise data will be available. To give an example, let 
us focus on the number 1) of classes containing a 

single node with degree 1 in a network of size N. In 
the null model where nodes are shuffled randomly be- 
tween classes (in a single realization of the network) , this 
number is distributed following a hypergeometric distri- 
bution "centered" in (F N (l,l)) = MC/N, where M is 
the number of degree-one nodes in the whole network, 
and C is the number of size-one classes. Simulations of 
the DDIW model for several realizations (in the two non- 
extensive variants) consistently yield values of Fjy(l, 1) 
that lie several standard deviations above the mean of 
the null-model distribution. We have measured M, C, N, 



and j for the E. coli PPI network, both using Ensembl 
[40] and SUPERFAMILY [36] homology data; the ac- 
tual value of Fjv(l, 1) is larger than the null average, in 
both datasets, by approximately 4 ~ 6 standard devia- 
tions, thus confirming the qualitative non-null prediction 
of the model. Future work could be directed towards a 
more detailed study of joint laws such as this one. As 
an example, the full numbers Fif(i, k) of classes contain- 
ing i nodes of total degree k are a class of interesting 
observables which are probably accessible by standard 
mean-field techniques. 

The model variants can be approached by analytical 
estimates and direct simulation, and matched with em- 
pirical data on both homology classes and PPI networks. 
This fitting procedure constitutes a proof of principle of 
the general applicability of the framework defined here. 
It also allows to fix the few parameters of the model, and 
produces well-defined comparisons of the model's predic- 
tions with data. 

In order to explicitly carry out this comparison in a 
specific case study, we considered the problem of repro- 
ducing the empirical age dependency of PPI network in- 
teractions through different variants of the model. We 
tested the predictions obtained against data from yeast, 
where both PPI network and gene duplications are well 
characterized, and the duplication age of individual pro- 
teins is also available. We were able to show that the 
empirical duplication age patterns of interacting protein 
pairs can be reproduced in two alternative ways. First, 
by an anti-preferential attachment prescription in the in- 
novation move, associated to a heavy (extensive) contri- 
bution of this move to the number of links. Second, by 
inserting a strong negative bias towards forming protein- 
protein interactions with old nodes. However, the first 
choice leads to networks whose degree distribution and 
neighbour degree correlations do not resemble the empir- 
ical ones. Conversely, the bias imposed in the second case 
could be rationalized by biological arguments concerning 
the available binding interfaces (older proteins are more 
likely fully engaged with the interactions they partici- 
pate into) and the conservation of basic biological func- 
tions (new interactions interfering with older ones could 
be detrimental). Thus, an age-dependent duplication- 
divergence move seems more satisfactory. Once estab- 
lished that such an age dependence in the divergence 
process is in qualitative agreement with data, one can 
ask whether the same features can be reproduced with- 
out considering the full partition/topology dynamics. 
We have performed additional numerical simulations and 
found that the qualitative patterns in Fig. [6] can be re- 
produced also by a simple duplication-divergence model 
with age bias and no innovation nor class dynamics. This 
is not in contrast with the importance of considering the 
problem in the more general framework, since in prin- 
ciple — as we have explained in the previous section 
- other mechanisms, related to the innovation/ wiring 
move, could have been responsible for the age correla- 
tion patterns observed. 
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Overall, our analysis tends to support the hypothesis 
that duplication-divergence alone does not account for 
the observed history-dependency of the existing protein- 
protein interactions [25]. Note, however, that in the 
age-dependent DDIW model, as well as in the previ- 
ous models of this kind, duplication-divergence turns 
out to be a necessary ingredient in shaping biologically- 
resembling degree distributions and degree correlations 
of nearest neighbors. This suggests that the mechanism 
of duplication and divergence might play a role in de- 
termining PPI network topologies |25j . Conversely, in 
the previous model of Kim and Marcotte, the age de- 
pendency is associated to model moves that, roughly 
speaking, are more similar to an anti-preferential attach- 
ment innovation move than to a duplication-divergence 
one [35]. We should also remark that the models we 
have explored here are based on totally asymmetric 
duplication-divergence. We cannot exclude that the age- 
correlation patterns could be biased also by using gen- 
eral duplication-divergence schemes [3T], where different 
values of sigma are assigned to the connections between 
pairs of new nodes with respect to new-old node pairs. 
In this case, the introduction of an additional parameter 
could produce the age correlation kernel in a natural way. 

One important caveat is that the PPI data avail- 
able to us arc affected by strong sub-sampling problems, 
since presumably for most organisms only a fraction of 
the protein-protein interactions are available in the DIP 
database [35]. Having small samples of large networks 
makes it problematic to estimate model parameters. For 
example, it is likely that the exponent for L(N) is over- 
estimated. We performed a numerical test by growing 
networks up to size N (and a fluctuating number of links 
L') and sumbsampling them to a fixed number of links 
L. In general one obtains networks with many more 
nodes (N 1 ) compared to networks that are grown with 
the model at L' edges and not subsampled. For pa- 
rameter values that match the available data, this error 
could be as large as 100%; in C. elegans, for instance, 
for which approximately 4000 interactions are known in- 
volving around 2600 proteins (out of ~ 20000 genes), we 
obtain N' « 5100. On the positive side, restricting the 
parameter-matching analysis (Figure [5| of the model to 
the few highly sampled genomes does not change our re- 
sults. Nevertheless, it seems quite possible that a larger 
cross-genomic knowledge of PPI networks could change 
the quantitative picture emerging from these data, and 
possibly also the qualitative one. 

To conclude, despite of the current open questions, we 
believe that this general framework might be important 
to pose questions about the growth of PPI networks, as 
the network structure is intimately related to the parti- 
tioning in homology classes, and, quite importantly, to 
the class of biological functions that a specific homology 
class can perform [41] . 
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Appendix: Mean-field calculation of L(N) 

We give here the solutions to the mean-field equation 
(13). Let us call I^xj N) the solution with the choice 
Z(iV) = 1 (variant |A[) and I^N) the solution with the 
choice l(N) = ~/2L/N (variant pi). For both choices (fl3"]) 



is a standard first-order ordinary differential equation, 
whose solution can be readily computed with the help of 
Mathematica. One obtains 



iV 2 V CTP °w|const. + ^ 
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where P a (N) is defined as 
P a (N) = 



-N 



a-l 



(A.l) 



(A.2) 



[which is proportional to the asymptotic form of the in- 
novation probability, see Eq. Q], and T(a, z) is the upper 
incomplete gamma function 



T(a,z) = I f-Vd*. (A.3) 



The constant term depends only on a, <r, and the initial 
condition L(N ) = L Q . Notice that P a (N) — > when 
N — > oo, since a € (0,1). By substituting the asymp- 
totic expansion for the incomplete gamma to leading or- 
der around z — 



r(a, z) ~ r(a) 



a 



(A.4) 



into (A.l) one sees that the first term in curly brackets 
gives a contribution oc N 2a to the asymptotic form, while 
the second and third terms have a linear behavior oc N, 
thus recovering expression ( |14[ ). 

A similar, but more complicated, expression to equa- 



tion (A.l ) is found for L^N); we do not quote it here be- 
cause it is very large without being particularly instruc- 
tive; the same analysis gives the corresponding asymp- 
totic behavior (15). 
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Organism 


# Nodes 


# Links 


Sulfolobus solfataricus 


14 


9 


Arabidopsis thaliana 


136 


153 


Bos taurus 


30 


23 


Caenorhabditis elegans 


2647 


3985 


Chlamydomonas reinhardtii 


14 


17 


Danio rerio 


13 


9 


Drosophila melanogaster 


7500 


22737 


Gallus gallus 


11 


6 


Homo sapiens 


1850 


2370 


Mus musculus 


524 


457 


Pisum sativum 


10 


12 


Rattus norvegicus 


147 


112 


Saccharomyces cerevisiae 


4998 


21881 


Schizosaccharomyces pombe 


80 


160 


Xenopus laevis 


20 


14 


Bacillus subtilis 


34 


24 


Caulobacter crescentus 


18 


11 


Escherichia coli 


2640 


11545 


Helicobacter pylori 


700 


1354 


Mycobacterium tuberculosis 


13 


9 


Synechocystis sp. 


32 


29 


Xanthomonas campestris 


11 


10 



Table I: Genomes from DIP |35] and corresponding values of 
the number of nodes N and number of links L. Archaea on 
top, then prokaryotes, then bacteria (separated by horizontal 
lines). Note that the bacteria with small number of nodes are 
heavily undersampled in the data set, so that the number of 
effectively significant points is low (see Sec.[V|. 
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